This script is organized as the following:

1) Data cleaning

2) Data exploration

3) Statistical testing

3.1) Stationary test

3.2) ACF/PACF plots

4) Model building

5) Forecasting

NOTE: If you want to quickly see the overview, just go to the end and read the summary

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd

import datetime

import matplotlib.pyplot as plt
import pylab
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 15, 6

import statsmodels.api as sm  
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.stattools import adfuller, acf, pacf, q_stat
from statsmodels.graphics.gofplots import qqplot

from sklearn.metrics import mean_squared_error
from math import sqrt

Step 1) Data cleaning

- Date column is not in date format

- Some dates are missing, i.e. category L only has half of the dates

- Note: we are going to forecast 90 days ahead, so our basic unit is day not hours. So we will group the data into daily sales

In [2]:
np.random.seed(42)
data = pd.read_csv('../data/siteA.csv')
In [3]:
data['Date_Name'] = data['Date_Name'].apply(lambda x: datetime.datetime.strptime(x,'%Y-%m-%d'))
In [4]:
print data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 66160 entries, 0 to 66159
Data columns (total 4 columns):
Date_Name        66160 non-null datetime64[ns]
subCategory      66160 non-null object
Purchase_Hour    66160 non-null int64
sales            66160 non-null float64
dtypes: datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 2.0+ MB
None
In [5]:
# Make a dataframe for each category, and put it into a list...len(list) = 16
# Essentially becomes a list of dataframe

def make_dataframe(df, lis):
    f = lambda x: df[df['subCategory'] == x]
    return map(f, lis)

temp_list = data['subCategory'].unique()
splitted_data = make_dataframe(data, temp_list)
print len(splitted_data)
print splitted_data[2][0:5]
16
      Date_Name subCategory  Purchase_Hour   sales
9259 2016-03-01           C              7  460.34
9260 2016-03-04           C              7  414.03
9261 2016-03-07           C              7  370.37
9262 2016-03-08           C              7  368.68
9263 2016-03-11           C              7  439.15
In [6]:
# Looks like some have missing dates, like category L has only half of the dates.

idx = pd.date_range('09-26-2015', '09-30-2016')

for i, item in enumerate(temp_list):
    splitted_data[i] = splitted_data[i].groupby(by=['Date_Name'])['sales'].sum()
    print item, ':', splitted_data[i].shape
    
    # We fill in the missing dates with value 0. Because this may be holiday and the shop made 0 sales.
    splitted_data[i].index = pd.DatetimeIndex(splitted_data[i].index)
    splitted_data[i] = splitted_data[i].reindex(idx, fill_value=0)
    print item, ':', splitted_data[i].shape, '-', splitted_data[i].index

print '\n', splitted_data[0].head()
A : (345,)
A : (371,) - DatetimeIndex(['2015-09-26', '2015-09-27', '2015-09-28', '2015-09-29',
               '2015-09-30', '2015-10-01', '2015-10-02', '2015-10-03',
               '2015-10-04', '2015-10-05',
               ...
               '2016-09-21', '2016-09-22', '2016-09-23', '2016-09-24',
               '2016-09-25', '2016-09-26', '2016-09-27', '2016-09-28',
               '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', length=371, freq='D')
B : (345,)
B : (371,) - DatetimeIndex(['2015-09-26', '2015-09-27', '2015-09-28', '2015-09-29',
               '2015-09-30', '2015-10-01', '2015-10-02', '2015-10-03',
               '2015-10-04', '2015-10-05',
               ...
               '2016-09-21', '2016-09-22', '2016-09-23', '2016-09-24',
               '2016-09-25', '2016-09-26', '2016-09-27', '2016-09-28',
               '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', length=371, freq='D')
C : (345,)
C : (371,) - DatetimeIndex(['2015-09-26', '2015-09-27', '2015-09-28', '2015-09-29',
               '2015-09-30', '2015-10-01', '2015-10-02', '2015-10-03',
               '2015-10-04', '2015-10-05',
               ...
               '2016-09-21', '2016-09-22', '2016-09-23', '2016-09-24',
               '2016-09-25', '2016-09-26', '2016-09-27', '2016-09-28',
               '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', length=371, freq='D')
D : (319,)
D : (371,) - DatetimeIndex(['2015-09-26', '2015-09-27', '2015-09-28', '2015-09-29',
               '2015-09-30', '2015-10-01', '2015-10-02', '2015-10-03',
               '2015-10-04', '2015-10-05',
               ...
               '2016-09-21', '2016-09-22', '2016-09-23', '2016-09-24',
               '2016-09-25', '2016-09-26', '2016-09-27', '2016-09-28',
               '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', length=371, freq='D')
E : (339,)
E : (371,) - DatetimeIndex(['2015-09-26', '2015-09-27', '2015-09-28', '2015-09-29',
               '2015-09-30', '2015-10-01', '2015-10-02', '2015-10-03',
               '2015-10-04', '2015-10-05',
               ...
               '2016-09-21', '2016-09-22', '2016-09-23', '2016-09-24',
               '2016-09-25', '2016-09-26', '2016-09-27', '2016-09-28',
               '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', length=371, freq='D')
F : (345,)
F : (371,) - DatetimeIndex(['2015-09-26', '2015-09-27', '2015-09-28', '2015-09-29',
               '2015-09-30', '2015-10-01', '2015-10-02', '2015-10-03',
               '2015-10-04', '2015-10-05',
               ...
               '2016-09-21', '2016-09-22', '2016-09-23', '2016-09-24',
               '2016-09-25', '2016-09-26', '2016-09-27', '2016-09-28',
               '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', length=371, freq='D')
G : (345,)
G : (371,) - DatetimeIndex(['2015-09-26', '2015-09-27', '2015-09-28', '2015-09-29',
               '2015-09-30', '2015-10-01', '2015-10-02', '2015-10-03',
               '2015-10-04', '2015-10-05',
               ...
               '2016-09-21', '2016-09-22', '2016-09-23', '2016-09-24',
               '2016-09-25', '2016-09-26', '2016-09-27', '2016-09-28',
               '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', length=371, freq='D')
H : (345,)
H : (371,) - DatetimeIndex(['2015-09-26', '2015-09-27', '2015-09-28', '2015-09-29',
               '2015-09-30', '2015-10-01', '2015-10-02', '2015-10-03',
               '2015-10-04', '2015-10-05',
               ...
               '2016-09-21', '2016-09-22', '2016-09-23', '2016-09-24',
               '2016-09-25', '2016-09-26', '2016-09-27', '2016-09-28',
               '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', length=371, freq='D')
I : (330,)
I : (371,) - DatetimeIndex(['2015-09-26', '2015-09-27', '2015-09-28', '2015-09-29',
               '2015-09-30', '2015-10-01', '2015-10-02', '2015-10-03',
               '2015-10-04', '2015-10-05',
               ...
               '2016-09-21', '2016-09-22', '2016-09-23', '2016-09-24',
               '2016-09-25', '2016-09-26', '2016-09-27', '2016-09-28',
               '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', length=371, freq='D')
K : (343,)
K : (371,) - DatetimeIndex(['2015-09-26', '2015-09-27', '2015-09-28', '2015-09-29',
               '2015-09-30', '2015-10-01', '2015-10-02', '2015-10-03',
               '2015-10-04', '2015-10-05',
               ...
               '2016-09-21', '2016-09-22', '2016-09-23', '2016-09-24',
               '2016-09-25', '2016-09-26', '2016-09-27', '2016-09-28',
               '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', length=371, freq='D')
L : (194,)
L : (371,) - DatetimeIndex(['2015-09-26', '2015-09-27', '2015-09-28', '2015-09-29',
               '2015-09-30', '2015-10-01', '2015-10-02', '2015-10-03',
               '2015-10-04', '2015-10-05',
               ...
               '2016-09-21', '2016-09-22', '2016-09-23', '2016-09-24',
               '2016-09-25', '2016-09-26', '2016-09-27', '2016-09-28',
               '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', length=371, freq='D')
M : (345,)
M : (371,) - DatetimeIndex(['2015-09-26', '2015-09-27', '2015-09-28', '2015-09-29',
               '2015-09-30', '2015-10-01', '2015-10-02', '2015-10-03',
               '2015-10-04', '2015-10-05',
               ...
               '2016-09-21', '2016-09-22', '2016-09-23', '2016-09-24',
               '2016-09-25', '2016-09-26', '2016-09-27', '2016-09-28',
               '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', length=371, freq='D')
N : (345,)
N : (371,) - DatetimeIndex(['2015-09-26', '2015-09-27', '2015-09-28', '2015-09-29',
               '2015-09-30', '2015-10-01', '2015-10-02', '2015-10-03',
               '2015-10-04', '2015-10-05',
               ...
               '2016-09-21', '2016-09-22', '2016-09-23', '2016-09-24',
               '2016-09-25', '2016-09-26', '2016-09-27', '2016-09-28',
               '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', length=371, freq='D')
O : (346,)
O : (371,) - DatetimeIndex(['2015-09-26', '2015-09-27', '2015-09-28', '2015-09-29',
               '2015-09-30', '2015-10-01', '2015-10-02', '2015-10-03',
               '2015-10-04', '2015-10-05',
               ...
               '2016-09-21', '2016-09-22', '2016-09-23', '2016-09-24',
               '2016-09-25', '2016-09-26', '2016-09-27', '2016-09-28',
               '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', length=371, freq='D')
P : (300,)
P : (371,) - DatetimeIndex(['2015-09-26', '2015-09-27', '2015-09-28', '2015-09-29',
               '2015-09-30', '2015-10-01', '2015-10-02', '2015-10-03',
               '2015-10-04', '2015-10-05',
               ...
               '2016-09-21', '2016-09-22', '2016-09-23', '2016-09-24',
               '2016-09-25', '2016-09-26', '2016-09-27', '2016-09-28',
               '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', length=371, freq='D')
Q : (328,)
Q : (371,) - DatetimeIndex(['2015-09-26', '2015-09-27', '2015-09-28', '2015-09-29',
               '2015-09-30', '2015-10-01', '2015-10-02', '2015-10-03',
               '2015-10-04', '2015-10-05',
               ...
               '2016-09-21', '2016-09-22', '2016-09-23', '2016-09-24',
               '2016-09-25', '2016-09-26', '2016-09-27', '2016-09-28',
               '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', length=371, freq='D')

2015-09-26     66.52
2015-09-27     98.92
2015-09-28    405.99
2015-09-29    480.46
2015-09-30    443.97
Freq: D, Name: sales, dtype: float64

Step 2) Data exploratory

- We observe shop has no sales during Xmas season, Dec end~Jan beginning

- Low sales volume during summer, May~Sep

- Weekly seasonal effect: weekend usually lower sales volume than weekday

- Category M has some anomalies in the summer

- Category L are only sold half of the time in the whole year

- Maybe it's a shop in a college cafeteria? Low student volume during the weekend, Xmas, & summer.

In [7]:
# Time series in connected point plot format for each category.
# We can see the missing dates(or date with 0 sales) in the time series this way

fig, axs = plt.subplots(16,1, figsize=(15, 96))
axs = axs.ravel()

for i, item in enumerate(data['subCategory'].unique()):
    axs[i].plot(splitted_data[i].index, splitted_data[i], '-o')
    axs[i].set_title('Category {} Sales'.format(item))
In [8]:
# Seasonal decomposition plots. Let's confirm the weekly seasonality.

def decomposition(timeseries, item):
    decomposition = seasonal_decompose(timeseries, freq=7)
    fig = plt.figure()  
    fig = decomposition.plot()  
    fig.set_size_inches(15, 8)
    fig.suptitle('Category {} - Seasonal Decomposition Plot'.format(item))

_ = map(decomposition, splitted_data, temp_list)
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/pyplot.py:412: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_num_figures`).
  max_open_warning, RuntimeWarning)
<matplotlib.figure.Figure at 0x107f4f190>
<matplotlib.figure.Figure at 0x10bc82390>
<matplotlib.figure.Figure at 0x10be99490>
<matplotlib.figure.Figure at 0x10bb9a150>
<matplotlib.figure.Figure at 0x10fbf4d90>
<matplotlib.figure.Figure at 0x110087350>
<matplotlib.figure.Figure at 0x110b717d0>
<matplotlib.figure.Figure at 0x11064a190>
<matplotlib.figure.Figure at 0x1112cad90>
<matplotlib.figure.Figure at 0x1112caa90>
<matplotlib.figure.Figure at 0x112ffe110>
<matplotlib.figure.Figure at 0x111ecc4d0>
<matplotlib.figure.Figure at 0x111ed5090>
<matplotlib.figure.Figure at 0x1142e4950>
<matplotlib.figure.Figure at 0x1142e4110>
<matplotlib.figure.Figure at 0x1142da510>

- We can definitely see the weekly effect in the above graphs.

Step 3) Statistical Testing

- We will test the time series for stationarity

- Transform them so they become stationary

- Detrend them

- Take care of the weekly seasonality observed previously

Step 3.1) Stationarity Test

In [11]:
# We will use Dickey-Fuller Test to check the stationarity of our time series data.

def test_stationarity(timeseries, item):
    
    #Determing rolling statistics
    rol_mean = pd.rolling_mean(timeseries, window=12)
    rol_std = pd.rolling_std(timeseries, window=12)

    #Plot rolling statistics:
    original = plt.plot(timeseries, color='blue',label='Original')
    mean_ = plt.plot(rol_mean, color='red', label='Rolling Mean')
    std_ = plt.plot(rol_std, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
    plt.title('Category {} - Rolling Mean & Standard Deviation'.format(item))
    plt.show(block=False)
    
    #Perform Dickey-Fuller test:
    print 'Results of Dickey-Fuller Test:'
    adf_test = adfuller(timeseries, autolag='AIC')
    adf_output = pd.Series(adf_test[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    
    for key,value in adf_test[4].items():
        adf_output['Critical Value (%s)'%key] = value
    print adf_output
In [12]:
_ = map(test_stationarity, splitted_data, temp_list)
/Library/Python/2.7/site-packages/ipykernel/__main__.py:6: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=12,center=False).mean()
/Library/Python/2.7/site-packages/ipykernel/__main__.py:7: FutureWarning: pd.rolling_std is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=12,center=False).std()
Results of Dickey-Fuller Test:
Test Statistic                  -2.295463
p-value                          0.173436
#Lags Used                      14.000000
Number of Observations Used    356.000000
Critical Value (5%)             -2.869693
Critical Value (1%)             -3.448853
Critical Value (10%)            -2.571114
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                  -2.428629
p-value                          0.133819
#Lags Used                      14.000000
Number of Observations Used    356.000000
Critical Value (5%)             -2.869693
Critical Value (1%)             -3.448853
Critical Value (10%)            -2.571114
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                  -2.466672
p-value                          0.123795
#Lags Used                      15.000000
Number of Observations Used    355.000000
Critical Value (5%)             -2.869716
Critical Value (1%)             -3.448906
Critical Value (10%)            -2.571126
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                  -2.255306
p-value                          0.186788
#Lags Used                       8.000000
Number of Observations Used    362.000000
Critical Value (5%)             -2.869557
Critical Value (1%)             -3.448544
Critical Value (10%)            -2.571041
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                  -2.023702
p-value                          0.276278
#Lags Used                      15.000000
Number of Observations Used    355.000000
Critical Value (5%)             -2.869716
Critical Value (1%)             -3.448906
Critical Value (10%)            -2.571126
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                  -2.251972
p-value                          0.187926
#Lags Used                      14.000000
Number of Observations Used    356.000000
Critical Value (5%)             -2.869693
Critical Value (1%)             -3.448853
Critical Value (10%)            -2.571114
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                  -2.361139
p-value                          0.153006
#Lags Used                      14.000000
Number of Observations Used    356.000000
Critical Value (5%)             -2.869693
Critical Value (1%)             -3.448853
Critical Value (10%)            -2.571114
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                  -2.173112
p-value                          0.216149
#Lags Used                      14.000000
Number of Observations Used    356.000000
Critical Value (5%)             -2.869693
Critical Value (1%)             -3.448853
Critical Value (10%)            -2.571114
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                  -2.155512
p-value                          0.222786
#Lags Used                      14.000000
Number of Observations Used    356.000000
Critical Value (5%)             -2.869693
Critical Value (1%)             -3.448853
Critical Value (10%)            -2.571114
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                  -2.132693
p-value                          0.231574
#Lags Used                      15.000000
Number of Observations Used    355.000000
Critical Value (5%)             -2.869716
Critical Value (1%)             -3.448906
Critical Value (10%)            -2.571126
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                  -2.164976
p-value                          0.219202
#Lags Used                      10.000000
Number of Observations Used    360.000000
Critical Value (5%)             -2.869602
Critical Value (1%)             -3.448646
Critical Value (10%)            -2.571065
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                  -2.717033
p-value                          0.071137
#Lags Used                      14.000000
Number of Observations Used    356.000000
Critical Value (5%)             -2.869693
Critical Value (1%)             -3.448853
Critical Value (10%)            -2.571114
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                  -2.408838
p-value                          0.139258
#Lags Used                      14.000000
Number of Observations Used    356.000000
Critical Value (5%)             -2.869693
Critical Value (1%)             -3.448853
Critical Value (10%)            -2.571114
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                  -2.493939
p-value                          0.116956
#Lags Used                      14.000000
Number of Observations Used    356.000000
Critical Value (5%)             -2.869693
Critical Value (1%)             -3.448853
Critical Value (10%)            -2.571114
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                  -2.402126
p-value                          0.141137
#Lags Used                       9.000000
Number of Observations Used    361.000000
Critical Value (5%)             -2.869580
Critical Value (1%)             -3.448595
Critical Value (10%)            -2.571053
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                  -2.230331
p-value                          0.195422
#Lags Used                      17.000000
Number of Observations Used    353.000000
Critical Value (5%)             -2.869763
Critical Value (1%)             -3.449011
Critical Value (10%)            -2.571151
dtype: float64

We can see most categories have test statistics are larger than critical value (10%)

So most categories are not stationary...let's make it stationary!

1) Log it

2) Diff(Log)

3) Season_diff(Diff(Log))

In [13]:
# We will use this function repeatedly to make plots for -> Transformed Data vs. Original Data

def plot_data(original, trans_2, item, title, label_1, label_2):
    original.plot(label='{}'.format(label_1))
    trans_2.plot(label='{}'.format(label_2))
    plt.legend(loc='best')
    plt.title('Category {} - {}'.format(item, title))
    plt.show(block=False)
In [27]:
def log_data(timeseries, item):
#     logged_data = timeseries.apply(lambda x: np.log(x))
    return np.log(timeseries)

splitted_data_log = map(log_data, splitted_data, temp_list)
print splitted_data[0].head(), '\n', splitted_data_log[0].head()
2015-09-26     66.52
2015-09-27     98.92
2015-09-28    405.99
2015-09-29    480.46
2015-09-30    443.97
Freq: D, Name: sales, dtype: float64 
2015-09-26    4.197503
2015-09-27    4.594311
2015-09-28    6.006329
2015-09-29    6.174744
2015-09-30    6.095757
Freq: D, Name: sales, dtype: float64
/Library/Python/2.7/site-packages/ipykernel/__main__.py:3: RuntimeWarning: divide by zero encountered in log
  app.launch_new_instance()
In [28]:
# Let's see the log plot.

_ = map(plot_data, splitted_data, splitted_data_log, temp_list, \
        ['Log']*len(temp_list), ['original']*len(temp_list), ['logged']*len(temp_list))
In [30]:
# Replace inf, -inf with 0...because after we log(0), we got inf values

print splitted_data_log[15]['2015-12-20']
replace_inf = lambda x: x.replace([np.inf, -np.inf], 0)
splitted_data_log = map(replace_inf, splitted_data_log)
print splitted_data_log[15]['2015-12-20']
-inf
0.0
In [31]:
# Plot the log again after we replaced inf, -inf with 0.

_ = map(plot_data, splitted_data, splitted_data_log, temp_list, \
        ['Log']*len(temp_list), ['original']*len(temp_list), ['logged']*len(temp_list))
In [33]:
# Differencing to deal with the trend.
# Log first difference

def diff_data(timeseries, item):
    timeseries_diff = timeseries - timeseries.shift()
    return timeseries_diff

splitted_data_log_diff = map(diff_data, splitted_data_log, temp_list)
In [34]:
print splitted_data_log_diff[0].head()
2015-09-26         NaN
2015-09-27    0.396809
2015-09-28    1.412017
2015-09-29    0.168415
2015-09-30   -0.078987
Freq: D, Name: sales, dtype: float64

Because we've taken a shift in the time series, the first day becomes NaN.

We must drop the NaN for Dickey-Fuller Test to work.

In [35]:
_ = map(lambda x: x.dropna(inplace=True), splitted_data_log_diff)
_ = map(test_stationarity, splitted_data_log_diff, temp_list)
/Library/Python/2.7/site-packages/ipykernel/__main__.py:6: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=12,center=False).mean()
/Library/Python/2.7/site-packages/ipykernel/__main__.py:7: FutureWarning: pd.rolling_std is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=12,center=False).std()
Results of Dickey-Fuller Test:
Test Statistic                -5.867731e+00
p-value                        3.295171e-07
#Lags Used                     1.300000e+01
Number of Observations Used    3.560000e+02
Critical Value (5%)           -2.869693e+00
Critical Value (1%)           -3.448853e+00
Critical Value (10%)          -2.571114e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -6.379569e+00
p-value                        2.239677e-08
#Lags Used                     1.300000e+01
Number of Observations Used    3.560000e+02
Critical Value (5%)           -2.869693e+00
Critical Value (1%)           -3.448853e+00
Critical Value (10%)          -2.571114e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -6.283016e+00
p-value                        3.756192e-08
#Lags Used                     1.300000e+01
Number of Observations Used    3.560000e+02
Critical Value (5%)           -2.869693e+00
Critical Value (1%)           -3.448853e+00
Critical Value (10%)          -2.571114e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                  -4.358665
p-value                          0.000351
#Lags Used                      14.000000
Number of Observations Used    355.000000
Critical Value (5%)             -2.869716
Critical Value (1%)             -3.448906
Critical Value (10%)            -2.571126
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                  -5.491730
p-value                          0.000002
#Lags Used                      13.000000
Number of Observations Used    356.000000
Critical Value (5%)             -2.869693
Critical Value (1%)             -3.448853
Critical Value (10%)            -2.571114
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -5.973994e+00
p-value                        1.906609e-07
#Lags Used                     1.300000e+01
Number of Observations Used    3.560000e+02
Critical Value (5%)           -2.869693e+00
Critical Value (1%)           -3.448853e+00
Critical Value (10%)          -2.571114e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -6.126546e+00
p-value                        8.601670e-08
#Lags Used                     1.300000e+01
Number of Observations Used    3.560000e+02
Critical Value (5%)           -2.869693e+00
Critical Value (1%)           -3.448853e+00
Critical Value (10%)          -2.571114e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -6.236673e+00
p-value                        4.806850e-08
#Lags Used                     1.300000e+01
Number of Observations Used    3.560000e+02
Critical Value (5%)           -2.869693e+00
Critical Value (1%)           -3.448853e+00
Critical Value (10%)          -2.571114e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                  -5.512843
p-value                          0.000002
#Lags Used                      13.000000
Number of Observations Used    356.000000
Critical Value (5%)             -2.869693
Critical Value (1%)             -3.448853
Critical Value (10%)            -2.571114
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -5.994589e+00
p-value                        1.713572e-07
#Lags Used                     1.300000e+01
Number of Observations Used    3.560000e+02
Critical Value (5%)           -2.869693e+00
Critical Value (1%)           -3.448853e+00
Critical Value (10%)          -2.571114e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -5.757345e+00
p-value                        5.778576e-07
#Lags Used                     1.300000e+01
Number of Observations Used    3.560000e+02
Critical Value (5%)           -2.869693e+00
Critical Value (1%)           -3.448853e+00
Critical Value (10%)          -2.571114e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -5.875896e+00
p-value                        3.160216e-07
#Lags Used                     1.300000e+01
Number of Observations Used    3.560000e+02
Critical Value (5%)           -2.869693e+00
Critical Value (1%)           -3.448853e+00
Critical Value (10%)          -2.571114e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -6.529351e+00
p-value                        9.959692e-09
#Lags Used                     1.300000e+01
Number of Observations Used    3.560000e+02
Critical Value (5%)           -2.869693e+00
Critical Value (1%)           -3.448853e+00
Critical Value (10%)          -2.571114e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -6.117129e+00
p-value                        9.037992e-08
#Lags Used                     1.300000e+01
Number of Observations Used    3.560000e+02
Critical Value (5%)           -2.869693e+00
Critical Value (1%)           -3.448853e+00
Critical Value (10%)          -2.571114e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                  -5.592441
p-value                          0.000001
#Lags Used                      14.000000
Number of Observations Used    355.000000
Critical Value (5%)             -2.869716
Critical Value (1%)             -3.448906
Critical Value (10%)            -2.571126
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                  -5.386274
p-value                          0.000004
#Lags Used                      14.000000
Number of Observations Used    355.000000
Critical Value (5%)             -2.869716
Critical Value (1%)             -3.448906
Critical Value (10%)            -2.571126
dtype: float64

- Looks good! Mostly becomes stationary

- Now let's deal with seasonality

In [38]:
# Differencing to deal with seasonality...
# Log seasonal difference

splitted_data_log_diff_season = map(lambda x: x - x.shift(7), splitted_data_log_diff)
_ = map(lambda x: x.dropna(inplace=True), splitted_data_log_diff_season)
In [39]:
# Let's see the Dickey-Fuller test after taking the log, diff, & season diff...

_ = map(test_stationarity, splitted_data_log_diff_season, temp_list)
/Library/Python/2.7/site-packages/ipykernel/__main__.py:6: FutureWarning: pd.rolling_mean is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=12,center=False).mean()
/Library/Python/2.7/site-packages/ipykernel/__main__.py:7: FutureWarning: pd.rolling_std is deprecated for Series and will be removed in a future version, replace with 
	Series.rolling(window=12,center=False).std()
Results of Dickey-Fuller Test:
Test Statistic                -7.089922e+00
p-value                        4.437585e-10
#Lags Used                     1.700000e+01
Number of Observations Used    3.450000e+02
Critical Value (5%)           -2.869954e+00
Critical Value (1%)           -3.449447e+00
Critical Value (10%)          -2.571253e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -7.858329e+00
p-value                        5.367554e-12
#Lags Used                     1.600000e+01
Number of Observations Used    3.460000e+02
Critical Value (5%)           -2.869930e+00
Critical Value (1%)           -3.449392e+00
Critical Value (10%)          -2.571240e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -7.946767e+00
p-value                        3.204447e-12
#Lags Used                     1.600000e+01
Number of Observations Used    3.460000e+02
Critical Value (5%)           -2.869930e+00
Critical Value (1%)           -3.449392e+00
Critical Value (10%)          -2.571240e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -6.592978e+00
p-value                        7.038593e-09
#Lags Used                     1.500000e+01
Number of Observations Used    3.470000e+02
Critical Value (5%)           -2.869906e+00
Critical Value (1%)           -3.449337e+00
Critical Value (10%)          -2.571227e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -6.928629e+00
p-value                        1.098679e-09
#Lags Used                     1.500000e+01
Number of Observations Used    3.470000e+02
Critical Value (5%)           -2.869906e+00
Critical Value (1%)           -3.449337e+00
Critical Value (10%)          -2.571227e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -7.702990e+00
p-value                        1.324068e-11
#Lags Used                     1.500000e+01
Number of Observations Used    3.470000e+02
Critical Value (5%)           -2.869906e+00
Critical Value (1%)           -3.449337e+00
Critical Value (10%)          -2.571227e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -7.636023e+00
p-value                        1.951488e-11
#Lags Used                     1.500000e+01
Number of Observations Used    3.470000e+02
Critical Value (5%)           -2.869906e+00
Critical Value (1%)           -3.449337e+00
Critical Value (10%)          -2.571227e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -7.875851e+00
p-value                        4.846561e-12
#Lags Used                     1.600000e+01
Number of Observations Used    3.460000e+02
Critical Value (5%)           -2.869930e+00
Critical Value (1%)           -3.449392e+00
Critical Value (10%)          -2.571240e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -7.034861e+00
p-value                        6.052888e-10
#Lags Used                     1.500000e+01
Number of Observations Used    3.470000e+02
Critical Value (5%)           -2.869906e+00
Critical Value (1%)           -3.449337e+00
Critical Value (10%)          -2.571227e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -8.006229e+00
p-value                        2.263876e-12
#Lags Used                     1.600000e+01
Number of Observations Used    3.460000e+02
Critical Value (5%)           -2.869930e+00
Critical Value (1%)           -3.449392e+00
Critical Value (10%)          -2.571240e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -5.982565e+00
p-value                        1.823813e-07
#Lags Used                     1.600000e+01
Number of Observations Used    3.460000e+02
Critical Value (5%)           -2.869930e+00
Critical Value (1%)           -3.449392e+00
Critical Value (10%)          -2.571240e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -7.669657e+00
p-value                        1.606231e-11
#Lags Used                     1.600000e+01
Number of Observations Used    3.460000e+02
Critical Value (5%)           -2.869930e+00
Critical Value (1%)           -3.449392e+00
Critical Value (10%)          -2.571240e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -7.987777e+00
p-value                        2.521748e-12
#Lags Used                     1.600000e+01
Number of Observations Used    3.460000e+02
Critical Value (5%)           -2.869930e+00
Critical Value (1%)           -3.449392e+00
Critical Value (10%)          -2.571240e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -7.668630e+00
p-value                        1.615808e-11
#Lags Used                     1.600000e+01
Number of Observations Used    3.460000e+02
Critical Value (5%)           -2.869930e+00
Critical Value (1%)           -3.449392e+00
Critical Value (10%)          -2.571240e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -8.012994e+00
p-value                        2.176062e-12
#Lags Used                     1.500000e+01
Number of Observations Used    3.470000e+02
Critical Value (5%)           -2.869906e+00
Critical Value (1%)           -3.449337e+00
Critical Value (10%)          -2.571227e+00
dtype: float64
Results of Dickey-Fuller Test:
Test Statistic                -8.186785e+00
p-value                        7.860575e-13
#Lags Used                     1.400000e+01
Number of Observations Used    3.480000e+02
Critical Value (5%)           -2.869881e+00
Critical Value (1%)           -3.449282e+00
Critical Value (10%)          -2.571214e+00
dtype: float64

All the test statistics look really good.

Next step is to visualize the ACF/PACF plots to determine we will use MA, AR, or ARIMA model.

Step 3.2) ACF/PACF Plots for

- Observed data

- Log transformed data

- Diff Log transformed data

- Seasonal Diff Log transformed data

In [40]:
def acf_pacf_plot(timeseries, title, item, lags=30):
    fig = plt.figure(figsize=(12,8))
    ax1 = fig.add_subplot(211)
    fig = sm.graphics.tsa.plot_acf(timeseries, lags=lags, ax=ax1)
    ax2 = fig.add_subplot(212)
    fig = sm.graphics.tsa.plot_pacf(timeseries, lags=lags, ax=ax2)
    fig.suptitle('Category {} - {} ACF/PACF Plot'.format(item, title), fontsize=20)
In [41]:
# Observed data ACF/PACF plot.

_ = map(acf_pacf_plot, splitted_data, ['Observed Data']*16, temp_list)
/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/matplotlib/collections.py:548: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  if self._edgecolors == 'face':

We observe the following:

- A), B), C), D), E), F), G), H), I), K), L), M), N), O), P) Both tails off

- Q) PACF cuts off at 1, probably AR(1) model...

In [42]:
# Let's see the log transformed ACF/PACF plot.

_ = map(acf_pacf_plot, splitted_data_log, ['Log']*16, temp_list)

We observe these:

A), D), E), F), G), H), I), L), P) both tail

B), C), K), M), N), O), Q) PACF cut 1

Some cut offs are not clear...possibly AR(1)?

In [43]:
# Diff log ACF/PACF plot.

_ = map(acf_pacf_plot, splitted_data_log_diff, ['Diff Log']*16, temp_list)

- We see Diff Log transformation has both tails off, which is an ARMA behavior.

- So, the observed data may be fitted using the ARIMA model.

In [44]:
# Finally, let's see the Season Diff Log transformed ACF/PACF plot.

_ = map(acf_pacf_plot, splitted_data_log_diff_season, ['Season Diff Log']*len(temp_list), temp_list, [range(0, 182, 7)]*16)

We observed:

All - ACF cut 1...SMA(1)? -> SARIMA(0, 1, 1)?

Some are rather not clear

No worries, we will use grid search to help us find the optimized model!

Step 4) Model Building

- At this point we are sure we need MA and SMA models.

- The AR and SAR models are uncertain.

- The difference is 1 in both normal and seasonal lag.

- We will use grid search to find out the AR & SAR order.

In [45]:
# We will use this function to help us search the best p & q values, and seasonal P & Q values.
# Four for loops is a lot, but we will be using it mostly for 1 or 2 values in each p, q, P & Q values.
# Because we already have a good idea these values won't be more than 2.

def evaluate_models(dataset, p_values, q_values, P_values, Q_values):
    best_score, best_cfg, best_season = float("inf"), None, None
    for p in p_values:
        for q in q_values:
            for P in P_values:
                for Q in Q_values:
                    order = (p,1,q)
                    season_order = (P, 1, Q, 7)
                    mod = sm.tsa.statespace.SARIMAX(dataset, trend='n', order=order, \
                                                       seasonal_order=season_order)
                    result = mod.fit()
                    aic = result.aic
                    if aic < best_score:
                        best_score, best_cfg, best_season = aic, order, season_order
                    print('SARIMA%s %s AIC=%.3f' % (order, season_order, aic))

    print('Best SARIMA%s %s AIC=%.3f' % (best_cfg, best_season, best_score))
    return [best_cfg, best_season]
In [46]:
def model_building(timeseries, item, p=range(0,2), q=range(1, 2), P=range(0,2), Q=range(1, 2)):
    orders = evaluate_models(timeseries, p, q, P, Q)
    mod = sm.tsa.statespace.SARIMAX(timeseries, trend='n', order=orders[0], seasonal_order=orders[1])
    results_1 = mod.fit()
    
    print 'Category {}: '.format(item), results_1.summary()
    return results_1
In [47]:
results_1 = map(model_building, splitted_data, temp_list)
SARIMA(0, 1, 1) (0, 1, 1, 7) AIC=4215.472
SARIMA(0, 1, 1) (1, 1, 1, 7) AIC=4206.725
SARIMA(1, 1, 1) (0, 1, 1, 7) AIC=4190.120
SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=4176.523
Best SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=4176.523
Category A:                                   Statespace Model Results                                
=========================================================================================
Dep. Variable:                             sales   No. Observations:                  371
Model:             SARIMAX(1, 1, 1)x(1, 1, 1, 7)   Log Likelihood               -2083.261
Date:                           Fri, 17 Feb 2017   AIC                           4176.523
Time:                                   03:53:15   BIC                           4196.104
Sample:                               09-26-2015   HQIC                          4184.299
                                    - 09-30-2016                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.6234      0.048     13.044      0.000       0.530       0.717
ma.L1         -0.9306      0.027    -35.034      0.000      -0.983      -0.879
ar.S.L7        0.3666      0.052      6.996      0.000       0.264       0.469
ma.S.L7       -0.8887      0.045    -19.801      0.000      -0.977      -0.801
sigma2      5529.4423    270.603     20.434      0.000    4999.070    6059.815
===================================================================================
Ljung-Box (Q):                       36.60   Jarque-Bera (JB):               327.22
Prob(Q):                              0.62   Prob(JB):                         0.00
Heteroskedasticity (H):               0.62   Skew:                            -0.08
Prob(H) (two-sided):                  0.01   Kurtosis:                         7.65
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
SARIMA(0, 1, 1) (0, 1, 1, 7) AIC=6243.323
SARIMA(0, 1, 1) (1, 1, 1, 7) AIC=6220.134
SARIMA(1, 1, 1) (0, 1, 1, 7) AIC=6223.667
SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=6196.915
Best SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=6196.915
Category B:                                   Statespace Model Results                                
=========================================================================================
Dep. Variable:                             sales   No. Observations:                  371
Model:             SARIMAX(1, 1, 1)x(1, 1, 1, 7)   Log Likelihood               -3093.457
Date:                           Fri, 17 Feb 2017   AIC                           6196.915
Time:                                   03:53:18   BIC                           6216.496
Sample:                               09-26-2015   HQIC                          6204.692
                                    - 09-30-2016                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.5748      0.044     13.209      0.000       0.490       0.660
ma.L1         -0.9151      0.031    -29.470      0.000      -0.976      -0.854
ar.S.L7        0.4673      0.034     13.631      0.000       0.400       0.534
ma.S.L7       -0.9511      0.024    -39.242      0.000      -0.999      -0.904
sigma2      1.431e+06   5.99e+04     23.878      0.000    1.31e+06    1.55e+06
===================================================================================
Ljung-Box (Q):                       53.38   Jarque-Bera (JB):               926.93
Prob(Q):                              0.08   Prob(JB):                         0.00
Heteroskedasticity (H):               0.58   Skew:                            -0.06
Prob(H) (two-sided):                  0.00   Kurtosis:                        10.83
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
SARIMA(0, 1, 1) (0, 1, 1, 7) AIC=5507.428
SARIMA(0, 1, 1) (1, 1, 1, 7) AIC=5498.134
SARIMA(1, 1, 1) (0, 1, 1, 7) AIC=5493.923
SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=5483.052
Best SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=5483.052
Category C:                                   Statespace Model Results                                
=========================================================================================
Dep. Variable:                             sales   No. Observations:                  371
Model:             SARIMAX(1, 1, 1)x(1, 1, 1, 7)   Log Likelihood               -2736.526
Date:                           Fri, 17 Feb 2017   AIC                           5483.052
Time:                                   03:53:24   BIC                           5502.633
Sample:                               09-26-2015   HQIC                          5490.829
                                    - 09-30-2016                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.6062      0.043     14.124      0.000       0.522       0.690
ma.L1         -0.9019      0.030    -30.052      0.000      -0.961      -0.843
ar.S.L7        0.2907      0.049      5.946      0.000       0.195       0.386
ma.S.L7       -0.8770      0.042    -21.087      0.000      -0.958      -0.795
sigma2      2.022e+05   8282.856     24.416      0.000    1.86e+05    2.18e+05
===================================================================================
Ljung-Box (Q):                       58.57   Jarque-Bera (JB):               682.01
Prob(Q):                              0.03   Prob(JB):                         0.00
Heteroskedasticity (H):               0.49   Skew:                             0.07
Prob(H) (two-sided):                  0.00   Kurtosis:                         9.71
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
SARIMA(0, 1, 1) (0, 1, 1, 7) AIC=4588.618
SARIMA(0, 1, 1) (1, 1, 1, 7) AIC=4573.988
SARIMA(1, 1, 1) (0, 1, 1, 7) AIC=4571.019
SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=4550.263
Best SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=4550.263
Category D:                                   Statespace Model Results                                
=========================================================================================
Dep. Variable:                             sales   No. Observations:                  371
Model:             SARIMAX(1, 1, 1)x(1, 1, 1, 7)   Log Likelihood               -2270.132
Date:                           Fri, 17 Feb 2017   AIC                           4550.263
Time:                                   03:53:27   BIC                           4569.844
Sample:                               09-26-2015   HQIC                          4558.040
                                    - 09-30-2016                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.5554      0.050     11.048      0.000       0.457       0.654
ma.L1         -0.8896      0.031    -28.869      0.000      -0.950      -0.829
ar.S.L7        0.4297      0.047      9.060      0.000       0.337       0.523
ma.S.L7       -0.9025      0.040    -22.314      0.000      -0.982      -0.823
sigma2      1.551e+04    488.317     31.768      0.000    1.46e+04    1.65e+04
===================================================================================
Ljung-Box (Q):                       43.05   Jarque-Bera (JB):              2334.38
Prob(Q):                              0.34   Prob(JB):                         0.00
Heteroskedasticity (H):               1.02   Skew:                             1.33
Prob(H) (two-sided):                  0.92   Kurtosis:                        15.13
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
SARIMA(0, 1, 1) (0, 1, 1, 7) AIC=4616.554
SARIMA(0, 1, 1) (1, 1, 1, 7) AIC=4618.140
SARIMA(1, 1, 1) (0, 1, 1, 7) AIC=4603.303
SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=4603.539
Best SARIMA(1, 1, 1) (0, 1, 1, 7) AIC=4603.303
Category E:                                   Statespace Model Results                                
=========================================================================================
Dep. Variable:                             sales   No. Observations:                  371
Model:             SARIMAX(1, 1, 1)x(0, 1, 1, 7)   Log Likelihood               -2297.651
Date:                           Fri, 17 Feb 2017   AIC                           4603.303
Time:                                   03:53:32   BIC                           4618.968
Sample:                               09-26-2015   HQIC                          4609.524
                                    - 09-30-2016                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.5730      0.069      8.304      0.000       0.438       0.708
ma.L1         -0.8142      0.054    -14.953      0.000      -0.921      -0.708
ma.S.L7       -0.7261      0.031    -23.366      0.000      -0.787      -0.665
sigma2      1.815e+04    853.376     21.272      0.000    1.65e+04    1.98e+04
===================================================================================
Ljung-Box (Q):                       78.03   Jarque-Bera (JB):               237.24
Prob(Q):                              0.00   Prob(JB):                         0.00
Heteroskedasticity (H):               0.36   Skew:                             0.32
Prob(H) (two-sided):                  0.00   Kurtosis:                         6.91
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
SARIMA(0, 1, 1) (0, 1, 1, 7) AIC=7003.378
SARIMA(0, 1, 1) (1, 1, 1, 7) AIC=6985.209
SARIMA(1, 1, 1) (0, 1, 1, 7) AIC=6972.528
/Library/Python/2.7/site-packages/statsmodels/base/model.py:496: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=6950.920
Best SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=6950.920
/Library/Python/2.7/site-packages/statsmodels/base/model.py:496: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
Category F:                                   Statespace Model Results                                
=========================================================================================
Dep. Variable:                             sales   No. Observations:                  371
Model:             SARIMAX(1, 1, 1)x(1, 1, 1, 7)   Log Likelihood               -3470.452
Date:                           Fri, 17 Feb 2017   AIC                           6950.903
Time:                                   03:53:38   BIC                           6970.484
Sample:                               09-26-2015   HQIC                          6958.680
                                    - 09-30-2016                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.7469      0.030     24.600      0.000       0.687       0.806
ma.L1         -0.9676      0.025    -38.781      0.000      -1.017      -0.919
ar.S.L7        0.5411      0.034     15.956      0.000       0.475       0.608
ma.S.L7       -0.9556      0.023    -41.856      0.000      -1.000      -0.911
sigma2      1.146e+07   7.02e-10   1.63e+16      0.000    1.15e+07    1.15e+07
===================================================================================
Ljung-Box (Q):                       58.28   Jarque-Bera (JB):              1036.74
Prob(Q):                              0.03   Prob(JB):                         0.00
Heteroskedasticity (H):               0.52   Skew:                             0.17
Prob(H) (two-sided):                  0.00   Kurtosis:                        11.27
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number  4e+31. Standard errors may be unstable.
SARIMA(0, 1, 1) (0, 1, 1, 7) AIC=5710.877
SARIMA(0, 1, 1) (1, 1, 1, 7) AIC=5694.221
SARIMA(1, 1, 1) (0, 1, 1, 7) AIC=5682.777
SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=5663.447
Best SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=5663.447
Category G:                                   Statespace Model Results                                
=========================================================================================
Dep. Variable:                             sales   No. Observations:                  371
Model:             SARIMAX(1, 1, 1)x(1, 1, 1, 7)   Log Likelihood               -2826.724
Date:                           Fri, 17 Feb 2017   AIC                           5663.447
Time:                                   03:53:41   BIC                           5683.028
Sample:                               09-26-2015   HQIC                          5671.224
                                    - 09-30-2016                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.6657      0.034     19.469      0.000       0.599       0.733
ma.L1         -0.9554      0.022    -43.804      0.000      -0.998      -0.913
ar.S.L7        0.4338      0.045      9.622      0.000       0.345       0.522
ma.S.L7       -0.9006      0.036    -24.883      0.000      -0.972      -0.830
sigma2      3.334e+05   1.36e+04     24.436      0.000    3.07e+05     3.6e+05
===================================================================================
Ljung-Box (Q):                       41.09   Jarque-Bera (JB):               813.22
Prob(Q):                              0.42   Prob(JB):                         0.00
Heteroskedasticity (H):               0.55   Skew:                            -0.06
Prob(H) (two-sided):                  0.00   Kurtosis:                        10.33
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
SARIMA(0, 1, 1) (0, 1, 1, 7) AIC=6102.786
SARIMA(0, 1, 1) (1, 1, 1, 7) AIC=6079.633
SARIMA(1, 1, 1) (0, 1, 1, 7) AIC=6076.382
SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=6051.028
Best SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=6051.028
Category H:                                   Statespace Model Results                                
=========================================================================================
Dep. Variable:                             sales   No. Observations:                  371
Model:             SARIMAX(1, 1, 1)x(1, 1, 1, 7)   Log Likelihood               -3020.514
Date:                           Fri, 17 Feb 2017   AIC                           6051.028
Time:                                   03:53:44   BIC                           6070.609
Sample:                               09-26-2015   HQIC                          6058.805
                                    - 09-30-2016                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.6137      0.041     15.021      0.000       0.534       0.694
ma.L1         -0.9289      0.024    -38.132      0.000      -0.977      -0.881
ar.S.L7        0.5302      0.026     20.185      0.000       0.479       0.582
ma.S.L7       -0.9993      0.038    -26.508      0.000      -1.073      -0.925
sigma2      9.357e+05   4.09e-08   2.29e+13      0.000    9.36e+05    9.36e+05
===================================================================================
Ljung-Box (Q):                       45.19   Jarque-Bera (JB):              1037.76
Prob(Q):                              0.26   Prob(JB):                         0.00
Heteroskedasticity (H):               0.97   Skew:                             0.35
Prob(H) (two-sided):                  0.86   Kurtosis:                        11.25
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 1.39e+28. Standard errors may be unstable.
SARIMA(0, 1, 1) (0, 1, 1, 7) AIC=5240.439
SARIMA(0, 1, 1) (1, 1, 1, 7) AIC=5217.565
SARIMA(1, 1, 1) (0, 1, 1, 7) AIC=5215.947
SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=5191.568
Best SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=5191.568
Category I:                                   Statespace Model Results                                
=========================================================================================
Dep. Variable:                             sales   No. Observations:                  371
Model:             SARIMAX(1, 1, 1)x(1, 1, 1, 7)   Log Likelihood               -2590.784
Date:                           Fri, 17 Feb 2017   AIC                           5191.568
Time:                                   03:53:47   BIC                           5211.149
Sample:                               09-26-2015   HQIC                          5199.345
                                    - 09-30-2016                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.6336      0.041     15.627      0.000       0.554       0.713
ma.L1         -0.9332      0.022    -42.104      0.000      -0.977      -0.890
ar.S.L7        0.5824      0.030     19.462      0.000       0.524       0.641
ma.S.L7       -0.9984      0.378     -2.638      0.008      -1.740      -0.257
sigma2      8.797e+04   3.42e+04      2.571      0.010    2.09e+04    1.55e+05
===================================================================================
Ljung-Box (Q):                       52.73   Jarque-Bera (JB):               631.64
Prob(Q):                              0.09   Prob(JB):                         0.00
Heteroskedasticity (H):               1.28   Skew:                             0.75
Prob(H) (two-sided):                  0.18   Kurtosis:                         9.29
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
SARIMA(0, 1, 1) (0, 1, 1, 7) AIC=4900.609
SARIMA(0, 1, 1) (1, 1, 1, 7) AIC=4891.158
SARIMA(1, 1, 1) (0, 1, 1, 7) AIC=4883.013
SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=4869.759
Best SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=4869.759
Category K:                                   Statespace Model Results                                
=========================================================================================
Dep. Variable:                             sales   No. Observations:                  371
Model:             SARIMAX(1, 1, 1)x(1, 1, 1, 7)   Log Likelihood               -2429.879
Date:                           Fri, 17 Feb 2017   AIC                           4869.759
Time:                                   03:53:50   BIC                           4889.340
Sample:                               09-26-2015   HQIC                          4877.535
                                    - 09-30-2016                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.4968      0.051      9.749      0.000       0.397       0.597
ma.L1         -0.8760      0.035    -25.083      0.000      -0.944      -0.808
ar.S.L7        0.3202      0.062      5.199      0.000       0.200       0.441
ma.S.L7       -0.8753      0.052    -16.988      0.000      -0.976      -0.774
sigma2      3.742e+04   1143.023     32.740      0.000    3.52e+04    3.97e+04
===================================================================================
Ljung-Box (Q):                       27.90   Jarque-Bera (JB):              2159.29
Prob(Q):                              0.93   Prob(JB):                         0.00
Heteroskedasticity (H):               1.12   Skew:                             0.94
Prob(H) (two-sided):                  0.54   Kurtosis:                        14.80
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
SARIMA(0, 1, 1) (0, 1, 1, 7) AIC=4981.646
SARIMA(0, 1, 1) (1, 1, 1, 7) AIC=4969.702
SARIMA(1, 1, 1) (0, 1, 1, 7) AIC=4946.950
SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=4933.440
Best SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=4933.440
Category L:                                   Statespace Model Results                                
=========================================================================================
Dep. Variable:                             sales   No. Observations:                  371
Model:             SARIMAX(1, 1, 1)x(1, 1, 1, 7)   Log Likelihood               -2461.720
Date:                           Fri, 17 Feb 2017   AIC                           4933.440
Time:                                   03:53:54   BIC                           4953.021
Sample:                               09-26-2015   HQIC                          4941.216
                                    - 09-30-2016                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.6416      0.033     19.548      0.000       0.577       0.706
ma.L1         -0.9614      0.018    -52.709      0.000      -0.997      -0.926
ar.S.L7        0.5258      0.048     10.874      0.000       0.431       0.621
ma.S.L7       -0.9082      0.049    -18.588      0.000      -1.004      -0.812
sigma2      4.462e+04   2201.649     20.266      0.000    4.03e+04    4.89e+04
===================================================================================
Ljung-Box (Q):                       77.66   Jarque-Bera (JB):               582.99
Prob(Q):                              0.00   Prob(JB):                         0.00
Heteroskedasticity (H):               0.67   Skew:                             0.09
Prob(H) (two-sided):                  0.03   Kurtosis:                         9.21
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
SARIMA(0, 1, 1) (0, 1, 1, 7) AIC=7135.872
SARIMA(0, 1, 1) (1, 1, 1, 7) AIC=7111.014
SARIMA(1, 1, 1) (0, 1, 1, 7) AIC=7136.445
SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=7134.404
Best SARIMA(0, 1, 1) (1, 1, 1, 7) AIC=7111.014
Category M:                                   Statespace Model Results                                
=========================================================================================
Dep. Variable:                             sales   No. Observations:                  371
Model:             SARIMAX(0, 1, 1)x(1, 1, 1, 7)   Log Likelihood               -3551.507
Date:                           Fri, 17 Feb 2017   AIC                           7111.014
Time:                                   03:53:57   BIC                           7126.678
Sample:                               09-26-2015   HQIC                          7117.235
                                    - 09-30-2016                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ma.L1         -0.8466      0.027    -31.412      0.000      -0.899      -0.794
ar.S.L7        0.4139      0.020     20.861      0.000       0.375       0.453
ma.S.L7       -0.9956      0.035    -28.215      0.000      -1.065      -0.926
sigma2      1.746e+07   2.13e-09   8.21e+15      0.000    1.75e+07    1.75e+07
===================================================================================
Ljung-Box (Q):                       51.19   Jarque-Bera (JB):              2236.02
Prob(Q):                              0.11   Prob(JB):                         0.00
Heteroskedasticity (H):               7.03   Skew:                             2.12
Prob(H) (two-sided):                  0.00   Kurtosis:                        14.39
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 1.09e+31. Standard errors may be unstable.
SARIMA(0, 1, 1) (0, 1, 1, 7) AIC=6681.762
SARIMA(0, 1, 1) (1, 1, 1, 7) AIC=6657.668
SARIMA(1, 1, 1) (0, 1, 1, 7) AIC=6660.224
SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=6636.194
Best SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=6636.194
Category N:                                   Statespace Model Results                                
=========================================================================================
Dep. Variable:                             sales   No. Observations:                  371
Model:             SARIMAX(1, 1, 1)x(1, 1, 1, 7)   Log Likelihood               -3313.097
Date:                           Fri, 17 Feb 2017   AIC                           6636.194
Time:                                   03:54:02   BIC                           6655.775
Sample:                               09-26-2015   HQIC                          6643.971
                                    - 09-30-2016                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.6234      0.038     16.223      0.000       0.548       0.699
ma.L1         -0.9330      0.029    -32.525      0.000      -0.989      -0.877
ar.S.L7        0.4623      0.035     13.252      0.000       0.394       0.531
ma.S.L7       -0.9521      0.023    -41.481      0.000      -0.997      -0.907
sigma2        4.8e+06   2.06e+05     23.296      0.000     4.4e+06     5.2e+06
===================================================================================
Ljung-Box (Q):                       49.38   Jarque-Bera (JB):               844.74
Prob(Q):                              0.15   Prob(JB):                         0.00
Heteroskedasticity (H):               0.55   Skew:                            -0.08
Prob(H) (two-sided):                  0.00   Kurtosis:                        10.47
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
SARIMA(0, 1, 1) (0, 1, 1, 7) AIC=6618.759
SARIMA(0, 1, 1) (1, 1, 1, 7) AIC=6586.700
SARIMA(1, 1, 1) (0, 1, 1, 7) AIC=6596.389
/Library/Python/2.7/site-packages/statsmodels/base/model.py:496: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=6564.018
Best SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=6564.018
/Library/Python/2.7/site-packages/statsmodels/base/model.py:496: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  "Check mle_retvals", ConvergenceWarning)
Category O:                                   Statespace Model Results                                
=========================================================================================
Dep. Variable:                             sales   No. Observations:                  371
Model:             SARIMAX(1, 1, 1)x(1, 1, 1, 7)   Log Likelihood               -3277.009
Date:                           Fri, 17 Feb 2017   AIC                           6564.019
Time:                                   03:54:07   BIC                           6583.600
Sample:                               09-26-2015   HQIC                          6571.796
                                    - 09-30-2016                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.6590      0.042     15.507      0.000       0.576       0.742
ma.L1         -0.9353      0.026    -36.080      0.000      -0.986      -0.885
ar.S.L7        0.4887      0.030     16.113      0.000       0.429       0.548
ma.S.L7       -0.9980      0.037    -26.810      0.000      -1.071      -0.925
sigma2      3.839e+06      1e-08   3.83e+14      0.000    3.84e+06    3.84e+06
===================================================================================
Ljung-Box (Q):                       48.45   Jarque-Bera (JB):              1062.57
Prob(Q):                              0.17   Prob(JB):                         0.00
Heteroskedasticity (H):               1.04   Skew:                             0.67
Prob(H) (two-sided):                  0.83   Kurtosis:                        11.27
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 1.26e+29. Standard errors may be unstable.
SARIMA(0, 1, 1) (0, 1, 1, 7) AIC=3173.108
SARIMA(0, 1, 1) (1, 1, 1, 7) AIC=3152.616
SARIMA(1, 1, 1) (0, 1, 1, 7) AIC=3157.552
SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=3130.182
Best SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=3130.182
Category P:                                   Statespace Model Results                                
=========================================================================================
Dep. Variable:                             sales   No. Observations:                  371
Model:             SARIMAX(1, 1, 1)x(1, 1, 1, 7)   Log Likelihood               -1560.091
Date:                           Fri, 17 Feb 2017   AIC                           3130.182
Time:                                   03:54:11   BIC                           3149.763
Sample:                               09-26-2015   HQIC                          3137.959
                                    - 09-30-2016                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.4638      0.045     10.241      0.000       0.375       0.553
ma.L1         -0.8852      0.023    -38.525      0.000      -0.930      -0.840
ar.S.L7        0.5637      0.040     13.938      0.000       0.484       0.643
ma.S.L7       -0.9995      2.320     -0.431      0.667      -5.547       3.548
sigma2       299.4240    696.276      0.430      0.667   -1065.253    1664.101
===================================================================================
Ljung-Box (Q):                       54.01   Jarque-Bera (JB):               404.64
Prob(Q):                              0.07   Prob(JB):                         0.00
Heteroskedasticity (H):               1.60   Skew:                             0.31
Prob(H) (two-sided):                  0.01   Kurtosis:                         8.14
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
SARIMA(0, 1, 1) (0, 1, 1, 7) AIC=3647.449
SARIMA(0, 1, 1) (1, 1, 1, 7) AIC=3642.546
SARIMA(1, 1, 1) (0, 1, 1, 7) AIC=3639.113
SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=3637.845
Best SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=3637.845
Category Q:                                   Statespace Model Results                                
=========================================================================================
Dep. Variable:                             sales   No. Observations:                  371
Model:             SARIMAX(1, 1, 1)x(1, 1, 1, 7)   Log Likelihood               -1813.923
Date:                           Fri, 17 Feb 2017   AIC                           3637.845
Time:                                   03:54:15   BIC                           3657.426
Sample:                               09-26-2015   HQIC                          3645.622
                                    - 09-30-2016                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.2269      0.056      4.017      0.000       0.116       0.338
ma.L1         -0.7541      0.042    -17.859      0.000      -0.837      -0.671
ar.S.L7       -0.2520      0.048     -5.220      0.000      -0.347      -0.157
ma.S.L7       -0.5944      0.034    -17.386      0.000      -0.661      -0.527
sigma2      1260.3291     55.382     22.757      0.000    1151.783    1368.876
===================================================================================
Ljung-Box (Q):                       72.89   Jarque-Bera (JB):               236.46
Prob(Q):                              0.00   Prob(JB):                         0.00
Heteroskedasticity (H):               0.07   Skew:                             0.11
Prob(H) (two-sided):                  0.00   Kurtosis:                         6.95
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).

Results

Category E: (1, 1, 1)x(0, 1, 1, 7)

Category M: (0, 1, 1)x(1, 1, 1, 7)

All other categories: (1, 1, 1)x(1, 1, 1, 7)

In [51]:
# Next step: check the residual plots to see the models are good.

def get_residuals(timeseries_results):
    residuals = pd.DataFrame(timeseries_results.resid)
    
    return residuals
In [52]:
residuals = map(get_residuals, results_1)
In [55]:
def residual_analysis(residuals, item):
    fig = plt.figure(figsize=(12,12))

    ax1 = fig.add_subplot(411)
    ax1 = sm.graphics.tsa.plot_acf(residuals, lags=30, ax=ax1, title='Category {} - ACF of Residual'.format(item))

    ax2 = fig.add_subplot(412)
    ax2.title.set_text('Category {} - Q-Q Plot of Std Residuals'.format(item))
    ax2 = qqplot(residuals, line='r', ax=plt.gca())

    # Plot residual errors.
    ax3 = fig.add_subplot(413)
    ax3.title.set_text('Category {} - Standardized Residuals'.format(item))
    ax3 = plt.plot(residuals)
    # residuals.plot(kind='kde')

    temp_q_stat = sm.stats.diagnostic.acorr_ljungbox(residuals, lags=30)
    ax4 = fig.add_subplot(414)
    ax4.title.set_text('Category {} - P value for Ljung-Box Statistics'.format(item))
    ax4 = plt.plot(temp_q_stat[1], 'o')
    # print temp_q_stat[0]

    plt.tight_layout()
In [56]:
_ = map(residual_analysis, residuals, temp_list)

The residuals plots mostly look good. Let's move on to forecasting.

In [ ]:
 

Step 5) Prediction In Sample, and 90 Days Ahead

- We will plot the forecast for each category

- Plot the relative contribution from each category

- Add them up and plot the total forecast

- Measure the RMSE

- We will also build a separate model and forecast using top-down approach for comparison (what we did so far is bottom up)

In [59]:
def plot_forecast(results, timeseries, item):
    pred = (results.predict())
    forecast_90 = (results.forecast(steps=90))

    pred.plot(label='predict')
    forecast_90.plot(label='forecast')
    timeseries.plot(label='original')
    plt.title(' Category {} Forecast '.format(item))
    plt.legend()
    plt.show(block=False)
    
    return pred
In [73]:
pred = map(plot_forecast, results_1, splitted_data, temp_list)
In [74]:
# Report performance
def rmse_measurement(timeseries, pred, item):
    mse = mean_squared_error(timeseries, pred)
    rmse = sqrt(mse)
    print('Category {} - RMSE: %.3f'.format(item) % rmse)
    return rmse
In [75]:
rmse_total = map(rmse_measurement, splitted_data, pred, temp_list)
Category A - RMSE: 78.730
Category B - RMSE: 1297.897
Category C - RMSE: 484.574
Category D - RMSE: 130.402
Category E - RMSE: 148.935
Category F - RMSE: 3682.639
Category G - RMSE: 628.287
Category H - RMSE: 1038.580
Category I - RMSE: 317.536
Category K - RMSE: 202.394
Category L - RMSE: 225.621
Category M - RMSE: 4310.820
Category N - RMSE: 2391.483
Category O - RMSE: 2125.306
Category P - RMSE: 17.857
Category Q - RMSE: 37.184
In [78]:
print 'Total RMSE: ', np.sum(rmse_total)
Total RMSE:  17118.2432938

Total Forecast

In [79]:
forecast_90 = map(lambda x: x.forecast(steps=90), results_1)

total_forecast = pd.Series()
original_sales = pd.Series()
for i, df in enumerate(forecast_90):
    if i == 0:
        total_forecast = df
        original_sales = splitted_data[i]
    else:
        total_forecast = total_forecast + df
        original_sales = original_sales + splitted_data[i]
        
plt.plot(original_sales.index, original_sales, '-o')
plt.plot(total_forecast.index, total_forecast, '-o')
Out[79]:
[<matplotlib.lines.Line2D at 0x10be4cf50>]
In [80]:
def relative_cont_plot(forecast_90, item):
    category = forecast_90 / total_forecast
#     print 'Category {}: '.format(item), category.mean()
    if category.mean() >= 0.08:
        print 'Category {} Sales Contribution: '.format(item), category.mean()
    _ = plt.plot(category.index, category, '-o', label=item)
    _ = plt.legend(loc='right')
    _ = plt.title('Relative Contribution from Each Category')

print 'Top contributors are...'
_ = map(relative_cont_plot, forecast_90, temp_list)
Top contributors are...
Category B Sales Contribution:  0.0910972385491
Category F Sales Contribution:  0.197065185693
Category M Sales Contribution:  0.152004401953
Category N Sales Contribution:  0.155479167606
Category O Sales Contribution:  0.142705634591

Top Down Approach - Build Model and Forecast on All Categories.

In [81]:
print original_sales.head()
idx = pd.date_range('09-26-2015', '09-30-2016')    

original_sales.index = pd.DatetimeIndex(original_sales.index)
original_sales = original_sales.reindex(idx, fill_value=0)
print original_sales.shape, '-', original_sales.index

print '\n', original_sales.head()
2015-09-26    14588.23
2015-09-27    15143.61
2015-09-28    84535.83
2015-09-29    94145.18
2015-09-30    91173.07
Freq: D, Name: sales, dtype: float64
(371,) - DatetimeIndex(['2015-09-26', '2015-09-27', '2015-09-28', '2015-09-29',
               '2015-09-30', '2015-10-01', '2015-10-02', '2015-10-03',
               '2015-10-04', '2015-10-05',
               ...
               '2016-09-21', '2016-09-22', '2016-09-23', '2016-09-24',
               '2016-09-25', '2016-09-26', '2016-09-27', '2016-09-28',
               '2016-09-29', '2016-09-30'],
              dtype='datetime64[ns]', length=371, freq='D')

2015-09-26    14588.23
2015-09-27    15143.61
2015-09-28    84535.83
2015-09-29    94145.18
2015-09-30    91173.07
Freq: D, Name: sales, dtype: float64
In [82]:
total_forecast = model_building(original_sales, 'Total')
SARIMA(0, 1, 1) (0, 1, 1, 7) AIC=8030.213
SARIMA(0, 1, 1) (1, 1, 1, 7) AIC=8008.632
SARIMA(1, 1, 1) (0, 1, 1, 7) AIC=8013.144
SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=7993.152
Best SARIMA(1, 1, 1) (1, 1, 1, 7) AIC=7993.152
Category Total:                                   Statespace Model Results                                
=========================================================================================
Dep. Variable:                             sales   No. Observations:                  371
Model:             SARIMAX(1, 1, 1)x(1, 1, 1, 7)   Log Likelihood               -3991.576
Date:                           Fri, 17 Feb 2017   AIC                           7993.152
Time:                                   04:11:21   BIC                           8012.733
Sample:                               09-26-2015   HQIC                          8000.929
                                    - 09-30-2016                                         
Covariance Type:                             opg                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.6595      0.060     11.052      0.000       0.543       0.776
ma.L1         -0.9290      0.045    -20.503      0.000      -1.018      -0.840
ar.S.L7        0.4685      0.053      8.884      0.000       0.365       0.572
ma.S.L7       -0.9476      0.033    -28.527      0.000      -1.013      -0.882
sigma2       2.67e+08   2.76e-11   9.68e+18      0.000    2.67e+08    2.67e+08
===================================================================================
Ljung-Box (Q):                       49.54   Jarque-Bera (JB):               577.06
Prob(Q):                              0.14   Prob(JB):                         0.00
Heteroskedasticity (H):               0.98   Skew:                             0.34
Prob(H) (two-sided):                  0.92   Kurtosis:                         9.14
===================================================================================

Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
[2] Covariance matrix is singular or near-singular, with condition number 1.1e+35. Standard errors may be unstable.
In [83]:
total_pred = total_forecast.predict()
forecast_90 = total_forecast.forecast(steps=90)

total_pred.plot(label='pred')
original_sales.plot(label='original')
forecast_90.plot(label='forecast')
plt.legend()
Out[83]:
<matplotlib.legend.Legend at 0x10bc014d0>
In [85]:
rmse_measurement(original_sales, total_pred, 'Total')
Category Total - RMSE: 14958.273
Out[85]:
14958.272530736826

Summary

1) Statistical tests suggest

- We have lags and seasonal effects in the data

- So we need to use seasonal ARIMA model as opposed to regular ARIMA model

- The SARIMA models for each category will be slightly different, but shouldn't deviate too much from order 1

2) SARIMA models were used for forecasting each category

- Category E: (1, 1, 1)x(0, 1, 1, 7)

- Category M: (0, 1, 1)x(1, 1, 1, 7)

- All other categories: (1, 1, 1)x(1, 1, 1, 7)

3) We also used a top down approach (directly forecast without breaking down the categories)

- Top down approach (RMSE=14,958)

- Bottom up approach (RMSE=17,118)

- Seems like top down was able to generalize the time series better, it's interesting because one would expect the opposite

- But it could be the top down method is overfitting, so further validation is needed.

4) What could be improved?

- More data: if we have 5 years of data, we can definitely build more accurate models.

- More features: we could use weather data to see how climate affects sales in diff categories

- More features: also features like promotion or not, number of competitors nearby would also help

- Diff models: we could try xgboost

In [ ]: